Overview
Techniques covered
- We collect data to answer questions
- The first step is to describe and visualise pattens in our data
- Common descriptions include measures of:
- central tendency and spread of continuous variables (and of differences in these values)
- the frequency of categorical responses
- the relationships between variables (correlation)
- After that, we can quantify our confidence (see next workshop)
- Visualisation is an undervalued method for understanding data
An important task for researchers is to answer questions using data. We can often divide this activity into:
- Describing patterns in the data
- Quantifying how sure we are about those patterns
This session is all about describing and visualising patterns in the data to answer research questions.
In this session we will cover four techniques psychologists use to answer research questions.
- summarising numeric variables by central tendency and spread
- calculating the frequency of categorical responses
- calculating differences between scores or groups
- describing relationship between two variables
In the past psychologists have often neglected the first part (spotting and describing patterns) and jumped straight to the second — for example, they have been very keen to run hypothesis tests and calculate p values…
More recently, researchers have placed much more emphasis on describing and visualising the data — to really get a feel for the patterns they see — before trying to quantify the evidence it provides or make inferences from it.
We have already seen how to implement some of these techniques in R (e.g. using summarise()) and with ggplot.
However, more important than any specific technique in R, is this idea that we collect data to answer questions (not just for it’s own sake!).
Central tendency and spread
- the central tendency of the data describes the “middle” of a set of values
- the mean and median are the most common measures
- measures of spread or distribution of the data show where most of the values fall (i.e. what range of values are most likely)
- common measures are the standard deviation or interquartile range
- we have already seen how to calculate these statistics using
summarise()andgroup_by()
# this is a recap of earlier material
# calculate an average
# typical weight at baseline in the FIT trial
funimagery %>%
summarise(mean(kg1))
mean(kg1)
1 90.70536# boxplots show the interquartile range (IQR) as the height of the box.
# The IQR is the range which includes 50% of the data points
funimagery %>%
ggplot(aes(intervention, kg1)) +
geom_boxplot() +
scale_y_continuous(n.breaks = 10) # this extra line just adds more marks on the y-axisIf we have a single continous variable — that is, one stored in a numeric column in R — then can describe a few things about it, including:
- the central tendency of the data: e.g. mean, median (see here for refresher)
- the spread or distribution of the data: e.g. the standard deviation or interquartile range (see refresher here)
It’s important to remember that even simple descriptive statistics like the mean or standard deviation enable us to answer research questions — you don’t always need fancy statistics! For example, if we consider the funimagery data describing the RCT of functional imagery training, we could ask:
- “what was the typical weight of participants at baseline?” or
- “what was the range in which most participants’ weight fell?”
In part, you have already seen how techniques like group_by() and summarise(), or graphs like boxplots, can help calculate and present these descriptive statistics.
# (this is a recap of earlier material)
# typical weight at baseline
funimagery %>%
summarise(mean(kg1))
mean(kg1)
1 90.70536
# a boxplot showing the IQR as the box. The IQR includes 50% of participants
# so, we can see 50% of participants weighed between 80 and 100kg at baseline
funimagery %>%
ggplot(aes(intervention, kg1)) +
geom_boxplot() +
scale_y_continuous(n.breaks = 10) # this extra line just adds more marks on the y-axisDescribing differences
The previous table and boxplot showed patients’ weights at the start of the study.
There is also a variable in this dataset called weight_lost_end_trt, which shows how much weight people lost between starting and completing FIT or MI. In a previous session we made a boxplot like this:
However, in clinical trials, it’s important to measure participants for longer periods to judge whether the effect of a treatment is sustained.
Interventions for obesity and overweight can be successful, but patients may later regain weight (Hall & Kahan, 2018). And estimating how long weight loss is sustained for is important because it changes the long term prognosis of patients, and so how cost-effective an intervention is.
The funimagery data come from a study which followed people for 6 months after completing treatment (12 months after joining the study). The kg1 column records weights at baseline, and thekg3 columns records observations made at the end of follow-up
This means we can calculate weight loss from baseline to follow-up (not just the end of treatment, which has already been done for us).
To do this we need to create a new column in our dataset. Let’s call this weight_lost_end_followup.
To calculate this new column we need to subtract weight at baseline (kg1) from weight at the end of follow-up (kg3).
In R, we can do this with the mutate function:
# use `mutate` to create a NEW COLUMN of data
# this code shows the result just below the code chunk
funimagery %>%
mutate(weight_lost_end_followup = kg3 - kg1)
gender age kg1 kg2 kg3 person intervention weight_lost_end_trt weight_lost_end_followup
1 f 44 107.8 106.7 106.0 4 MI -1.1 -1.8
2 f 32 107.0 105.4 105.9 5 MI -1.6 -1.1
3 f 33 99.5 101.0 98.8 6 MI 1.5 -0.7
4 f 21 80.0 79.0 78.0 7 MI -1.0 -2.0
5 f 27 81.0 80.0 80.0 8 MI -1.0 -1.0
6 f 56 59.0 57.0 60.0 9 MI -2.0 1.0
7 f 50 95.0 92.0 92.0 10 MI -3.0 -3.0
8 m 57 90.0 87.0 87.0 11 MI -3.0 -3.0
9 f 34 87.0 87.0 86.4 12 MI 0.0 -0.6
10 f 25 121.2 123.0 119.7 13 MI 1.8 -1.5
11 m 70 84.0 81.9 84.0 14 MI -2.1 0.0
12 f 56 100.0 97.2 98.0 15 MI -2.8 -2.0
13 f 55 87.4 85.0 84.9 16 MI -2.4 -2.5
14 f 43 100.4 99.1 100.0 17 MI -1.3 -0.4
15 m 37 89.5 91.0 91.5 18 MI 1.5 2.0
16 m 45 84.2 85.6 86.2 19 MI 1.4 2.0
17 f 60 92.3 88.0 88.0 20 MI -4.3 -4.3
18 f 51 112.0 109.0 109.0 21 MI -3.0 -3.0
19 f 21 76.0 75.0 76.2 22 MI -1.0 0.2
20 m 43 89.5 87.2 88.0 23 MI -2.3 -1.5
21 f 21 118.0 116.7 115.0 24 MI -1.3 -3.0
22 m 19 102.4 99.4 100.0 25 MI -3.0 -2.4
23 m 65 79.4 80.2 79.5 26 MI 0.8 0.1
24 m 40 88.0 85.9 87.8 27 MI -2.1 -0.2
25 f 45 85.0 83.0 80.0 28 MI -2.0 -5.0
26 f 66 67.5 65.4 65.6 29 MI -2.1 -1.9
27 f 38 91.3 91.5 89.0 30 MI 0.2 -2.3
28 f 55 88.3 85.0 87.1 31 MI -3.3 -1.2
29 f 52 86.2 85.0 85.0 32 MI -1.2 -1.2
30 f 41 77.4 76.7 80.1 33 MI -0.7 2.7
31 m 40 79.0 76.5 80.0 34 MI -2.5 1.0
32 m 44 99.0 98.2 95.0 35 MI -0.8 -4.0
33 f 51 87.0 87.0 85.0 36 MI 0.0 -2.0
34 f 56 85.0 82.3 85.2 37 MI -2.7 0.2
35 m 40 69.0 66.1 67.0 38 MI -2.9 -2.0
36 f 30 119.0 127.0 123.0 39 MI 8.0 4.0
37 m 51 103.5 104.7 103.0 40 MI 1.2 -0.5
38 m 35 73.7 71.6 71.2 41 MI -2.1 -2.5
39 f 24 87.0 84.7 86.0 42 MI -2.3 -1.0
40 f 36 131.3 132.0 132.0 43 MI 0.7 0.7
41 f 47 120.0 120.0 122.0 44 MI 0.0 2.0
42 f 41 87.5 86.2 85.0 45 MI -1.3 -2.5
43 f 54 90.0 87.7 86.0 46 MI -2.3 -4.0
44 f 20 90.0 88.0 88.0 47 MI -2.0 -2.0
45 f 60 87.0 85.2 84.0 48 MI -1.8 -3.0
46 f 33 83.0 81.0 80.0 49 MI -2.0 -3.0
47 f 23 72.0 71.2 71.0 50 MI -0.8 -1.0
48 f 50 75.0 72.0 70.0 51 MI -3.0 -5.0
49 f 45 69.0 69.0 68.0 52 MI 0.0 -1.0
50 f 34 85.4 78.8 74.7 53 MI -6.6 -10.7
51 f 35 76.5 76.3 78.0 54 MI -0.2 1.5
52 m 56 74.1 70.8 72.0 55 MI -3.3 -2.1
53 f 33 93.4 94.5 93.5 58 MI 1.1 0.1
54 f 56 82.1 76.0 74.0 59 FIT -6.1 -8.1
55 f 20 90.5 88.0 84.5 60 FIT -2.5 -6.0
56 f 60 120.4 109.2 98.0 61 FIT -11.2 -22.4
57 f 59 97.2 94.6 90.0 62 FIT -2.6 -7.2
58 f 45 78.0 74.8 68.0 63 FIT -3.2 -10.0
59 f 39 79.9 75.8 68.5 64 FIT -4.1 -11.4
60 m 40 101.2 98.7 95.3 65 FIT -2.5 -5.9
61 f 25 111.0 100.0 90.0 66 FIT -11.0 -21.0
62 m 28 79.6 83.8 86.4 67 FIT 4.2 6.8
63 m 22 62.3 63.4 60.0 68 FIT 1.1 -2.3
64 f 26 76.8 70.0 65.0 69 FIT -6.8 -11.8
65 m 42 95.9 97.9 85.0 70 FIT 2.0 -10.9
66 f 35 84.7 82.0 80.0 71 FIT -2.7 -4.7
67 m 70 116.4 111.8 113.5 72 FIT -4.6 -2.9
68 f 46 100.2 94.0 90.0 73 FIT -6.2 -10.2
69 m 22 140.5 138.7 146.8 74 FIT -1.8 6.3
70 f 46 91.7 79.3 75.2 75 FIT -12.4 -16.5
71 f 60 85.4 80.2 70.0 76 FIT -5.2 -15.4
72 f 43 86.7 85.7 80.0 77 FIT -1.0 -6.7
73 f 58 112.6 95.0 98.5 78 FIT -17.6 -14.1
74 m 40 113.9 104.0 95.0 79 FIT -9.9 -18.9
75 f 60 75.5 70.0 70.0 80 FIT -5.5 -5.5
76 f 23 99.5 96.5 90.0 81 FIT -3.0 -9.5
77 f 55 101.5 96.1 97.0 82 FIT -5.4 -4.5
78 m 27 95.9 87.2 84.1 83 FIT -8.7 -11.8
79 f 33 94.9 90.9 92.0 84 FIT -4.0 -2.9
80 f 40 83.5 74.0 73.9 85 FIT -9.5 -9.6
81 f 50 107.0 104.4 95.0 86 FIT -2.6 -12.0
82 f 53 77.8 73.2 78.1 87 FIT -4.6 0.3
83 f 69 89.8 86.9 80.0 88 FIT -2.9 -9.8
84 m 48 88.9 85.2 85.0 89 FIT -3.7 -3.9
85 f 58 103.9 98.7 97.0 90 FIT -5.2 -6.9
86 m 35 64.0 59.0 58.0 91 FIT -5.0 -6.0
87 f 53 74.0 70.0 65.0 92 FIT -4.0 -9.0
88 m 36 98.3 89.0 94.0 93 FIT -9.3 -4.3
89 f 24 88.4 82.0 75.0 94 FIT -6.4 -13.4
90 f 46 88.6 84.6 85.0 95 FIT -4.0 -3.6
91 f 20 94.9 90.1 90.0 96 FIT -4.8 -4.9
92 f 62 76.3 72.1 72.0 97 FIT -4.2 -4.3
93 f 51 110.0 100.5 99.0 98 FIT -9.5 -11.0
94 f 42 82.0 78.2 70.0 99 FIT -3.8 -12.0
95 f 44 103.0 87.4 79.0 100 FIT -15.6 -24.0
96 f 23 95.9 93.6 93.2 101 FIT -2.3 -2.7
97 f 20 95.9 87.2 94.2 102 FIT -8.7 -1.7
98 f 56 90.4 79.0 85.0 103 FIT -11.4 -5.4
99 f 42 68.0 68.0 73.8 104 FIT 0.0 5.8
100 f 35 125.9 122.3 124.5 105 FIT -3.6 -1.4
101 f 64 82.4 77.8 81.8 106 FIT -4.6 -0.6
102 f 29 70.8 63.9 60.0 107 FIT -6.9 -10.8
103 f 72 73.3 73.5 73.4 108 FIT 0.2 0.1
104 m 48 78.7 79.5 70.0 109 FIT 0.8 -8.7
105 m 46 84.0 80.0 75.3 110 FIT -4.0 -8.7
106 f 66 72.6 68.0 64.0 111 FIT -4.6 -8.6
107 f 69 89.4 83.6 86.0 112 FIT -5.8 -3.4
108 f 64 121.4 114.0 114.0 113 FIT -7.4 -7.4
109 m 39 101.8 97.1 94.0 114 FIT -4.7 -7.8
110 m 50 80.7 75.2 74.1 115 FIT -5.5 -6.6
111 f 54 80.5 84.3 82.0 116 FIT 3.8 1.5
[ reached 'max' / getOption("max.print") -- omitted 1 rows ]- run the code above and show students the result
- point out that this has not been stored anywhere — just displayed in the RStudio GUI, below the code chunk
What mutate does is to make a copy of our dataset, but with a new column in. That is, it always gives us back a new dataset.
We almost always want to STORE this new copy of the dataset so we can use the new column that was created. To do this we assign the result of mutate funimagery dataset by assigning the result of mutate()) to a new variable (the ‘container’ type of variable).
The assignment operator is the left hand arrow, <-:
funimagery.edited <- funimagery %>%
mutate(weight_lost_end_followup = kg3 - kg1)- show how a new variable has been created in the Environment window
So the code above:
- takes the
funimagerydata and pipes it to themutate()function, which - adds a new column, called
weight_lost_end_followup - this new column is made by subtracting
kg1(baseline) fromkg3(end of followup); it then - stores this new copy of the dataset (with the extra column) in a new variable called
funimagery.edited:
We can then use this new variable, funimagery.edited, to do more work, like making a boxplot:
# boxploot of weight lost at end of follow-up using new column
funimagery.edited %>%
ggplot(aes(intervention, weight_lost_end_followup)) +
geom_boxplot()If anything, it looks like the difference between groups is even BIGGER after follow-up than it was at the end of treatment, which is very promising for FIT.
Exercise 1
- Open
session-4.rmdusing the Files pane. This is the workbook you will be using in this session. - Use
group_by()andsummarise()with the built-inirisdataset to calculate the meanSepal.Widthfor eachSpeciesof flower. - Make a boxplot that shows the sepal width for each species of flower.
These are the correct numbers to check your work against:
| Species | Mean sepal width |
|---|---|
| setosa | 3.4 |
| versicolor | 2.8 |
| virginica | 3.0 |
Your plot should look like this:
<<<<<<< HEADThe aes part of your ggplot code should be: